Lesson 3 Tidy data

… in which we explore the concept of Tidy Data and learn more advanced data wrangling techniques

3.1 Recap

  • a speedrun of lecture 1 and 2

3.2 Tidy data

3.2.1 What and why is tidy data?

Figure from https://r4ds.had.co.nz/tidy-data.html Wickham and Grolemund4

palmerpenguins::penguins
## # A tibble: 344 × 8
##    species island    bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
##    <fct>   <fct>              <dbl>         <dbl>             <int>       <int>
##  1 Adelie  Torgersen           39.1          18.7               181        3750
##  2 Adelie  Torgersen           39.5          17.4               186        3800
##  3 Adelie  Torgersen           40.3          18                 195        3250
##  4 Adelie  Torgersen           NA            NA                  NA          NA
##  5 Adelie  Torgersen           36.7          19.3               193        3450
##  6 Adelie  Torgersen           39.3          20.6               190        3650
##  7 Adelie  Torgersen           38.9          17.8               181        3625
##  8 Adelie  Torgersen           39.2          19.6               195        4675
##  9 Adelie  Torgersen           34.1          18.1               193        3475
## 10 Adelie  Torgersen           42            20.2               190        4250
## # … with 334 more rows, and 2 more variables: sex <fct>, year <int>

3.2.2 Make data tidy

with the tidyr package.

“Happy families are all alike; every unhappy family is unhappy in its own way”
— Leo Tolstoy (https://tidyr.tidyverse.org/articles/tidy-data.html)

Let’s make some data tidy!

  • table1
table1
## # A tibble: 6 × 4
##   country      year  cases population
##   <chr>       <int>  <int>      <int>
## 1 Afghanistan  1999    745   19987071
## 2 Afghanistan  2000   2666   20595360
## 3 Brazil       1999  37737  172006362
## 4 Brazil       2000  80488  174504898
## 5 China        1999 212258 1272915272
## 6 China        2000 213766 1280428583

3.2.3 pivot_wider

table2
## # A tibble: 12 × 4
##    country      year type            count
##    <chr>       <int> <chr>           <int>
##  1 Afghanistan  1999 cases             745
##  2 Afghanistan  1999 population   19987071
##  3 Afghanistan  2000 cases            2666
##  4 Afghanistan  2000 population   20595360
##  5 Brazil       1999 cases           37737
##  6 Brazil       1999 population  172006362
##  7 Brazil       2000 cases           80488
##  8 Brazil       2000 population  174504898
##  9 China        1999 cases          212258
## 10 China        1999 population 1272915272
## 11 China        2000 cases          213766
## 12 China        2000 population 1280428583
table2 %>% 
  pivot_wider(names_from = type, values_from = count)
## # A tibble: 6 × 4
##   country      year  cases population
##   <chr>       <int>  <int>      <int>
## 1 Afghanistan  1999    745   19987071
## 2 Afghanistan  2000   2666   20595360
## 3 Brazil       1999  37737  172006362
## 4 Brazil       2000  80488  174504898
## 5 China        1999 212258 1272915272
## 6 China        2000 213766 1280428583
  • table2

3.2.4 separate

table3 %>% 
  separate(col = rate, into = c("cases", "population"), sep = "/")
## # A tibble: 6 × 4
##   country      year cases  population
##   <chr>       <int> <chr>  <chr>     
## 1 Afghanistan  1999 745    19987071  
## 2 Afghanistan  2000 2666   20595360  
## 3 Brazil       1999 37737  172006362 
## 4 Brazil       2000 80488  174504898 
## 5 China        1999 212258 1272915272
## 6 China        2000 213766 1280428583
  • table3

3.2.5 pivot_longer

table4a %>% 
  pivot_longer(-country, names_to = "year", values_to = "cases")
## # A tibble: 6 × 3
##   country     year   cases
##   <chr>       <chr>  <int>
## 1 Afghanistan 1999     745
## 2 Afghanistan 2000    2666
## 3 Brazil      1999   37737
## 4 Brazil      2000   80488
## 5 China       1999  212258
## 6 China       2000  213766
  • table4a
table4b %>% 
  pivot_longer(-country, names_to = "year", values_to = "population")
## # A tibble: 6 × 3
##   country     year  population
##   <chr>       <chr>      <int>
## 1 Afghanistan 1999    19987071
## 2 Afghanistan 2000    20595360
## 3 Brazil      1999   172006362
## 4 Brazil      2000   174504898
## 5 China       1999  1272915272
## 6 China       2000  1280428583
clean_wide_data <- function(data, values_column) {
  data %>% 
    pivot_longer(-country, names_to = "year", values_to = values_column)
}

clean4a <- table4a %>% 
  clean_wide_data("cases")
clean4b <- table4b %>% 
  clean_wide_data("population")

3.2.6 left_join

left_join(clean4a, clean4b, by = c("country", "year"))
## # A tibble: 6 × 4
##   country     year   cases population
##   <chr>       <chr>  <int>      <int>
## 1 Afghanistan 1999     745   19987071
## 2 Afghanistan 2000    2666   20595360
## 3 Brazil      1999   37737  172006362
## 4 Brazil      2000   80488  174504898
## 5 China       1999  212258 1272915272
## 6 China       2000  213766 1280428583
  • table4a and table4b

3.2.7 unite

  • table5
table5 %>% 
  unite("year", century, year, sep = "") %>% 
  separate(rate, c("cases", "population")) %>% 
  mutate(
    year = parse_number(year),
    cases = parse_number(cases),
    population = parse_number(population)
  )
## # A tibble: 6 × 4
##   country      year  cases population
##   <chr>       <dbl>  <dbl>      <dbl>
## 1 Afghanistan  1999    745   19987071
## 2 Afghanistan  2000   2666   20595360
## 3 Brazil       1999  37737  172006362
## 4 Brazil       2000  80488  174504898
## 5 China        1999 212258 1272915272
## 6 China        2000 213766 1280428583
table5 %>% 
  unite("year", century, year, sep = "") %>% 
  separate(rate, c("cases", "population")) %>% 
  mutate(
    across(c(year, cases, population), parse_number)
  )
## # A tibble: 6 × 4
##   country      year  cases population
##   <chr>       <dbl>  <dbl>      <dbl>
## 1 Afghanistan  1999    745   19987071
## 2 Afghanistan  2000   2666   20595360
## 3 Brazil       1999  37737  172006362
## 4 Brazil       2000  80488  174504898
## 5 China        1999 212258 1272915272
## 6 China        2000 213766 1280428583
table5 %>% 
  unite("year", century, year, sep = "") %>% 
  separate(rate, c("cases", "population")) %>% 
  mutate(
    across(-country, parse_number)
  )
## # A tibble: 6 × 4
##   country      year  cases population
##   <chr>       <dbl>  <dbl>      <dbl>
## 1 Afghanistan  1999    745   19987071
## 2 Afghanistan  2000   2666   20595360
## 3 Brazil       1999  37737  172006362
## 4 Brazil       2000  80488  174504898
## 5 China        1999 212258 1272915272
## 6 China        2000 213766 1280428583

3.2.8 Another example

  • billboard
  • explicit vs implicit NAs
  • na.omit
billboard %>% 
  pivot_longer(starts_with("wk"), names_to = "week", values_to = "placement") %>% 
  mutate(week = parse_number(week))
## # A tibble: 24,092 × 5
##    artist track                   date.entered  week placement
##    <chr>  <chr>                   <date>       <dbl>     <dbl>
##  1 2 Pac  Baby Don't Cry (Keep... 2000-02-26       1        87
##  2 2 Pac  Baby Don't Cry (Keep... 2000-02-26       2        82
##  3 2 Pac  Baby Don't Cry (Keep... 2000-02-26       3        72
##  4 2 Pac  Baby Don't Cry (Keep... 2000-02-26       4        77
##  5 2 Pac  Baby Don't Cry (Keep... 2000-02-26       5        87
##  6 2 Pac  Baby Don't Cry (Keep... 2000-02-26       6        94
##  7 2 Pac  Baby Don't Cry (Keep... 2000-02-26       7        99
##  8 2 Pac  Baby Don't Cry (Keep... 2000-02-26       8        NA
##  9 2 Pac  Baby Don't Cry (Keep... 2000-02-26       9        NA
## 10 2 Pac  Baby Don't Cry (Keep... 2000-02-26      10        NA
## # … with 24,082 more rows
tidy_bilboard <- billboard %>% 
  pivot_longer(starts_with("wk"),
    names_to = "week",
    values_to = "placement",
    names_prefix = "wk",
    names_transform = list(week = as.integer)
  )
plt <- tidy_bilboard %>% 
  ggplot(aes(week, placement)) +
  geom_point(aes(label = paste(artist, track))) +
  geom_line(aes(group = paste(artist, track)))

plotly::ggplotly(plt)

3.3 More shapes for data

  • omitted:
    • matrices
    • arrays

3.3.1 Lists

c(first = 1, second = 2)
##  first second 
##      1      2
x <- list(first = 1, second = 2, "some text", list(1, 2), 1:5)
x
## $first
## [1] 1
## 
## $second
## [1] 2
## 
## [[3]]
## [1] "some text"
## 
## [[4]]
## [[4]][[1]]
## [1] 1
## 
## [[4]][[2]]
## [1] 2
## 
## 
## [[5]]
## [1] 1 2 3 4 5
palmerpenguins::penguins
## # A tibble: 344 × 8
##    species island    bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
##    <fct>   <fct>              <dbl>         <dbl>             <int>       <int>
##  1 Adelie  Torgersen           39.1          18.7               181        3750
##  2 Adelie  Torgersen           39.5          17.4               186        3800
##  3 Adelie  Torgersen           40.3          18                 195        3250
##  4 Adelie  Torgersen           NA            NA                  NA          NA
##  5 Adelie  Torgersen           36.7          19.3               193        3450
##  6 Adelie  Torgersen           39.3          20.6               190        3650
##  7 Adelie  Torgersen           38.9          17.8               181        3625
##  8 Adelie  Torgersen           39.2          19.6               195        4675
##  9 Adelie  Torgersen           34.1          18.1               193        3475
## 10 Adelie  Torgersen           42            20.2               190        4250
## # … with 334 more rows, and 2 more variables: sex <fct>, year <int>

3.3.2 Nested data

example <- tibble(
  x = 1:3,
  y = list(
    "hello",
    TRUE,
    1:4
  )
)

example
## # A tibble: 3 × 2
##       x y        
##   <int> <list>   
## 1     1 <chr [1]>
## 2     2 <lgl [1]>
## 3     3 <int [4]>
# View(example)
nested <- palmerpenguins::penguins %>% 
  nest(data = -island)

nested
## # A tibble: 3 × 2
##   island    data              
##   <fct>     <list>            
## 1 Torgersen <tibble [52 × 7]> 
## 2 Biscoe    <tibble [168 × 7]>
## 3 Dream     <tibble [124 × 7]>
nested$data[[1]]
## # A tibble: 52 × 7
##    species bill_length_mm bill_depth_mm flipper_length_… body_mass_g sex    year
##    <fct>            <dbl>         <dbl>            <int>       <int> <fct> <int>
##  1 Adelie            39.1          18.7              181        3750 male   2007
##  2 Adelie            39.5          17.4              186        3800 fema…  2007
##  3 Adelie            40.3          18                195        3250 fema…  2007
##  4 Adelie            NA            NA                 NA          NA <NA>   2007
##  5 Adelie            36.7          19.3              193        3450 fema…  2007
##  6 Adelie            39.3          20.6              190        3650 male   2007
##  7 Adelie            38.9          17.8              181        3625 fema…  2007
##  8 Adelie            39.2          19.6              195        4675 male   2007
##  9 Adelie            34.1          18.1              193        3475 <NA>   2007
## 10 Adelie            42            20.2              190        4250 <NA>   2007
## # … with 42 more rows
nested %>% 
  unnest(data)
## # A tibble: 344 × 8
##    island    species bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
##    <fct>     <fct>            <dbl>         <dbl>             <int>       <int>
##  1 Torgersen Adelie            39.1          18.7               181        3750
##  2 Torgersen Adelie            39.5          17.4               186        3800
##  3 Torgersen Adelie            40.3          18                 195        3250
##  4 Torgersen Adelie            NA            NA                  NA          NA
##  5 Torgersen Adelie            36.7          19.3               193        3450
##  6 Torgersen Adelie            39.3          20.6               190        3650
##  7 Torgersen Adelie            38.9          17.8               181        3625
##  8 Torgersen Adelie            39.2          19.6               195        4675
##  9 Torgersen Adelie            34.1          18.1               193        3475
## 10 Torgersen Adelie            42            20.2               190        4250
## # … with 334 more rows, and 2 more variables: sex <fct>, year <int>

3.4 Exercises

3.4.1 Tidy data

3.5 Resources